LARGE MARGIN FILTERING FOR SIGNAL SEQUENCE LABELING 

Remi Flamary, Benjamin Labbe, Alain Rakotomamonjy 

LITIS EA 4108, INSA-Universite de Rouen 
76801 Saint-Etienne-du-Rouvray, France 



ABSTRACT 

Signal Sequence Labeling consists in predicting a sequence of 
labels given an observed sequence of samples. A naive way 
is to filter the signal in order to reduce the noise and to apply 
a classification algorithm on the filtered samples. We propose 
in this paper to jointly learn the filter with the classifier lead- 
ing to a large margin filtering for classification. This method 
allows to learn the optimal cutoff frequency and phase of the 
filter that may be different from zero. Two methods are pro- 
posed and tested on a toy dataset and on a real life BCI dataset 
from BCI Competition III. 

Index Terms — Filtering, SVM ,BCI , Sequence Labeling 

1. INTRODUCTION 

The aim of signal sequence labeling is to assign a label to each 
sample of a multichannel signal while taking into account the 
sequentiality of the samples. This problem typically arises in 
speech signal segmentation or in Brain Computer Interfaces 
(BCI). Indeed, in real-time BCI applications, each sample of 
an electro-encephalography signal has to be interpreted as a 
specific command for a virtual keyboard or a robot hence the 
need for sample labeling 0~1[2). 

Many methods and algorithms have already been pro- 
posed for signal sequence labeling. For instance, Hidden 
Markov Models (HMM) (3| are statistical models that are 
able to learn a joint probability distribution of samples in a 
sequence and their labels. In some cases, Conditional Ran- 
dom Fields (CRF) H have been shown to outperform the 
HMM approach as they do not suppose the observation are 
independent. Structural Support Vector Machines (Struct- 
SVM), which are SVMs that learn a mapping from structured 
input to structured output, have also been considered for sig- 
nal segmentation 0. Signal sequence labeling can also be 
viewed from a very different perspective by considering a 
change detection method coupled with a supervised classifier. 
For instance, a Kernel Change Detection algorithm [6 1 can be 
used for detecting abrupt changes in a signal and afterwards 
a classifier applied for labeling the segmented regions. 
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In order to preprocess the signal, a filtering is often ap- 
plied and the resulting filtered samples are used as training 
examples for learning. Such an approach poses the issue of 
the filter choice, which is oftenly based on prior knowledge 
on the information brought by the signals. Moreover, mea- 
sured signals and extracted features may not be in phase with 
the labels and a time-lag due to the acquisition process ap- 
pears in the signals. For example, in the problem of decoding 
arm movements from brain signals, there exists a natural time 
shift between these two entries, hence in their works, Pistohl 
et al. Q had to select by a validation method a delay in their 
signal processing method. 

In this work, we address the problem of automated tuning 
of the filtering stage including its time-lag. Indeed, our objec- 
tive is to adapt the preprocessing filter and all its properties by 
including its setting into the learning process. Our hypothesis 
is that by fitting properly the filter to the classification prob- 
lem at hand, without relying on ad-hoc prior-knowledge, we 
should be able to considerably improve the sequence labeling 
performance. So we propose to take into account the temporal 
neighborhood of the current sample directly into the decision 
function and the learning process, leading to an automatic set- 
ting of the signal filtering. 

For this purpose, we first propose a naive approach based 
on SVMs which consists in considering, instead of a given 
time sample, a time-window around the sample. This method 
named as Window-SVM, allows us to learn a spatio-temporal 
classifier that will adapt itself to the signal time-lag. Then, we 
introduce another approach denoted as Filter-S VM which dis- 
sociates the filter and the classifier. This novel method jointly 
learns a SVM classifier and FIR filters coefficients. By do- 
ing so, we can interpret our filter as a large-margin filter for 
the problem at hand. These two methods are tested on a toy 
dataset and on a real life BCI signal sequence labeling prob- 
lem from BCI Competition III (TJ. 

2. LARGE MARGIN FILTER 
2.1. Problem definition 

Our concern is a signal sequence labeling problem : we want 
to obtain a sequence of labels from a multichannel time- 
sample of a signal or from multi-channel features extracted 



from that signal. We suppose that the training samples are 
gathered in a matrix X G M Arxd containing d channels and 
N samples. Xij is the value of channel j for the i th sample. 
The vector y G { — 1, 1}^ contains the class of each sample. 

In order to reduce noise in the samples or variability in 
the features, an usual approach is to filter X before the clas- 
sifier learning stage. In literature, all channels are usually fil- 
tered with the same filter (Savisky-Golay for instance in Q) 
although there is no reason for a single filter to be optimal 
for all channels. Let us define the filters applied to X by the 
matrix F G W xd . Each column of F is a filter for the corre- 
sponding channel in X and / is the size of the FIR filters. 

We define the filtered data matrix X by: 

/ 

m— 1 

where the sum is a unidimensional convolution of each chan- 
nel by the filter in the appropriate column of F. no is the 
delay of the filter, for instance no = corresponds to a causal 
filter and no = f /2 corresponds to a filter centered on the 
current sample. 

2.2. Windowed-SVM (W-SVM) 

As highlighted by Equation ((TJ, a filtering stage essentially 
consists in taking into account for a given time i, instead of 
the sample Xi.., a linear combination of its temporal neigh- 
borhood. However, instead of introducing a filter F, it is pos- 
sible to consider for classification a temporal window around 
the current sample. Such an approach would lead to this de- 
cision function for the i th sample of X: 

f d 
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where W G M.f xd and wo G K are the classification parame- 
ters and / is the size of the time-window. Note that W plays 
the role of the filter and the weights of a linear classifier. In a 
large-margin framework, W and wo may be learned by mini- 
mizing this functional: 

1 C N 

JwsvM(W) = -\\W\\ 2 F + -Y,H(y,XJ W} i) 2 (3) 

i=l 

where HW^H^ = J2i j Wf,j * s ^ e sc l uarec l Frobenius norm of 
W, C is a regularization term to be tuned and H (y, X, f, i) = 
max(0, 1 — yif(i, X)) is the SVM hinge loss. By vectorizing 
appropriately X and W, problem (0) may be transformed into 
a linear SVM. Hence, we can take advantage of many linear 
SVM solvers existing in the literature such as the one pro- 
posed by Chapelle (8). By using that solver, Window-SVM 
complexity is about 0(N.(f.d) 2 ) which scales quadratically 
with the filter dimension. 



The matrix W weights the importance of each sample 
value Xi t j into the decision function. Hence, channels may 
have different weights and time-lag. Indeed, W will automat- 
ically adapt to a phase difference between the sample labels 
and the channel signals. However, in this method since space 
and time are treated independently, W does not take into ac- 
count the multi-channel structure and the sequentiality of the 
samples. Since the samples of a given channel are known 
to be time-dependent due to the underlying physical process, 
it seems preferable to process them with a filter and to clas- 
sify the filtered samples. So we propose in the sequel another 
method that jointly learns the time-filtering and a linear clas- 
sifier on the filtered sample defined by Eq. (Q]). 

2.3. Large margin filtering (Filter-SVM) 

We propose to find the filter F that maximizes the margin of 
the linear classifier for the filtered samples. In this case, the 
decision function is: 

/ d 

f F (i, X) = ^ ^2 w J f mj^i+l- m +n j + W (4) 
m—1 j — 1 

where w and wo are the parameters of the linear SVM classi- 
fier corresponding to a weighting of the channels. By disso- 
ciating the filter and the decision function weights, we expect 
that some useless channels (non-informative or too noisy) 
for the decision function get small weights. Indeed, due to 
the double weighting Wj and Fj, and the specific channel 
weighting role played by Wj, this approach, as shown in the 
experimental section is able to perform channel selection. 

The decision function given in Equation © can be ob- 
tained by minimizing: 

Jfsvm = i||w|| 2 + |^i?(y,X,/ F ,*) 2 + ^||F|| 2 F (5) 

i=l 

w.r.t. (F,w,wo) where is the Frobenius norm, and A 

is a regularization term to be tuned. Note that without the 
regularization term ||-F|| F , the problem is ill-posed. Indeed, 
in such a case, one can always decrease ||w|| 2 while keeping 
the empirical hinge loss constant by multiplying w by a < 1 
and Fby - . 

The cost defined in Equation (0 is differentiable and 
provably non-convex when jointly optimized with respect to 
all parameters. However, Jfsvm is differentiable and convex 
with respect to w and wo when F is fixed as it corresponds 
to a linear SVM with squared hinge loss. Hence, for a given 
value of F, we can define 

i r " 

J(F) = min -||w|| 2 + -J2H(y,X,f F ,z) 2 

i=l 

which according to Bonnans et al. [9j is differentiable. Then 
if w* and u>q are the optimal values for a given F*, the gra- 
dient of the second term of J(-) with respect to F at the point 



F* is: 



JY 



V fmj J(F*) =-^2y i (w*X i _ m+1+no , j )xH(y,XJ F .,i] 

Now, since J(F) is differentiable and since its value can be 
easily computed by a linear SVM, we choose for learning the 
decision function to minimize J(F) + ^\\F\\' F with respects 
to F instead of minimizing problem ©. Note that due to the 
objective function non-convexity in problem ©, these two 
minimization problems are not strictly equivalent, but our ap- 
proach has the advantage of taking into account the intrinsic 
large-margin structure of the problem. 

Algorithm 1 Filter-SVM solver 

Set Fi, k = l/f for k = 1 • • • d and I = 1 • • • / 
repeat 

Dp gradient of Jfsvm with respect to F 



(F,w* 



Line-Search along Dp 



until Stopping criterion is reached 
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Fig. 1. Histograms of both labels with and without filtering 
(vertical axis are different) for a 1 channel signal with a = 1 
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For solving the optimization problem, we propose a gra- 
dient descent algorithm along F with a line search method 
for finding the optimal step. The method is detailed in algo- 
rithm Q] Note that at each computation of J(F) in the line 
search, the optimal w* and w$ are found by solving a linear 
SVM. The iterations in the algorithm may be stopped by two 
stopping criteria: a threshold on the relative variation of J(F) 
or a threshold on variations of F norm. 

Due to the non-convexity of the objective function, it is 
difficult to provide an exact evaluation of the solution com- 
plexity. However, we know that the gradient computation has 
order of O(N.f.d) and that when J(F) is computed at each 
step of the line search, a 0(N.d 2 ) linear SVM is solved and 
a O(N.f.d) filtering is applied. 



3. RESULTS 



3.1. Toy Example 



We use a toy example that consists of nbtot channels, only 
nbrel of them being discriminative. Discriminative channels 
have a switching mean {—1,1} controlled by the label and 
corrupted by a gaussian noise of deviation a. The length of 
the regions with constant label follows a uniform distribution 
law between [30, 40] samples and different time-lags are ap- 
plied to the channels. We selected / = 21 and no = 11 
corresponding to a good average filtering centered on the cur- 
rent sample. Figure[T]shows how the samples are transformed 
thanks to the filter F for a unidimensional signal. In this case, 
the mean test error due to the noise is 16% for the unfiltered 
signal, while only 2% for the optimally filtered signal. 

Window-SVM and Filter-SVM are compared to SVM 
without filtering, SVM with an average filter of size / (Avg- 
SVM) and HMM with a Viterbi decoding. The regularization 



Fig. 2. Test error for different a values (nbtot = 30, 
nbrel = 3, on the left) and for different number of channels 
nbtot (a = 3, nbrel = 3, on the right) 



parameters are selected by a validation method. The size 
of the signals is of 1000 samples for the learning and the 
validation sets and of 5000 samples for the test set. All the 
processes are run ten times, the test error is the the average 
over the runs. 



Win-SVM Filters 



Filter-SVM Filters 





Fig. 3. Coefficients of W (left) and coefficients F weighted 
by w (right) for nbrel — 3, nbtot = 30, a = 3 

The methods are compared for different a values with 
(nbtot = 30, nbrel = 3). The test error is plotted on the 
left of Figure [2] We can see that only Avg-SVM, Window- 
SVM and Filter-SVM adapt to time-lags between the chan- 
nels and the labels. Both Window-SVM and Filter-SVM out- 
perform the other methods, even if for a heavy noise, the last 
one seems to be slightly better. Then we test our methods for 
a varying number of channels in order to see how dimension 
is handled (nbrel = 3, a = 3). Figure [2] (right) shows the in- 
terest of Filter-SVM over Window-SVM in hight dimension 
as we can see that the last one tends to lose his efficiency, and 
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0.2550 
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/ = 100, n = 50 


0.0537 


0.1659 


0.3859 


0.2018 



Table 1. Test Error for BCI Dataset 



Class 1 againsl all 




Fig. 4. F filters (subject 1) for label 1 against all (left) and 
label 2 against all (right). 



even to be similar to Avg-SVM. This comes from the fact that 
Filter-SVM can more efficiently perform a channel selection 
thanks to the weighting of w. Figure [3] shows the filters re- 
turned by both methods. We observe that only the coefficients 
of the relevant signals are important and that the other signals 
tend to be eliminated by small weights for Filter-SVM, ex- 
plaining the better results in high dimension. 

3.2. BCI Dataset 

We test our method on the BCI Dataset from BCI Competition 
III H]. The problem is to obtain a sequence of labels out of 
brain activity signals for 3 human subjects. The data consists 
in 96 channels containing PSD features (3 training sessions, 
1 test session, N « 3000 per session) and the problem has 
3 labels (left arm, right arm or feet). 

We use Filter-SVM that showed better result in hight di- 
mension for the toy example. The multi-class aspect of the 
problem is handled by using a One- Against- All strategy. The 
regularization parameters are tuned using a grid search valida- 
tion method on the third training set. We compare our method 
to the best BCI competition results (using only 8 samples) and 
to the SVM without filtering. Test error for different filter size 
/ and delay no may be seen on Table Q] Results show that 
one can improve drastically the result by using longer filtering 
with causal filters (no = 0). Note that Filter-SVM outperform 
Avg-SVM with a centered filter. 

Another advantage of this method is that one can visualize 
a discriminative space-time map (channel selection, shape of 



the filter and delays). We show for instance in Figure [4] the 
discriminative filters F obtained for subject 1, and we can see 
that the filtering is extremely different depending on the task. 

The Matlab code corresponding to these results will be 
provided on our website for reproducibility. 

4. CONCLUSIONS 

We have proposed two methods for automatically learning a 
spatio-temporal filter used for multi-channel signal classifica- 
tion. Both methods have been tested on a toy example and on 
a real life dataset from BCI Competition III. 

Empirical results clearly show the benefits of adapting the 
signal filter to the large-margin classification problem despite 
the non-convexity of the criterion. 

In future work, we plan to extend our approach to non- 
linear case, we believe that a differentiable kernel can be used 
instead of inner products at the cost of solving the SVM in 
the dual space. Another perspective would be to adapt our 
methods to the multi-task situation, where one wants to jointly 
learn one matrix F and several classifiers (one per task). 
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